Single-cell and spatial transcriptomics data analysis with squidpy in python

Graduate Group in Bioinformatics, University of Assuit, computer science
4/7/2022

Table of Contents

1 Load Pakages

Besides squidpy, we need to import some useful packages.

2 Anndata object

We first load one spatial transcriptomics dataset into Anndata, and then explore the Anndata object a bit for spatial data storage and manipulation. One 10X Genomics Visium dataset will be analyzed with Squidpy in our project

2.1 load 10X Genomics Visium dataset into Anndata

A spatial gene expression dataset of mouse brain serial section 2 (Sagittal-Posterior) collected by Space Ranger 1.1.0. will be analyzed throughout the tutorial. Both the gene expression matrix and spatial imaging data are necessary for the computational analysis.

For illustration purposes,we load data from 10x Genomics, by sc.datasets.visium_sge()function that take sample_id and return anndata object that contain data our sample_id = V1_Mouse_Brain_Sagittal_Posterior_Section_2

The data files we will be using include:

2.2 Explore the Anndata object

A Anndata object serves as a container that contains both data (like the count matrix) ,spatial images and analysis (like dimension reduction or clustering results) for a single-cell dataset.

2.2.1 Dimension of Anndata object

Extract the dimension of the anadata object using shape, shape[0] for nraws, shape[1] for ncols

2.2.2 Feature and observation names

Extract the feature and observation names using . obs.names & var.names

2.2.3 observation-level metadata

observation-level metadata is stored as a data.frame, where each row correspond to one observation (e.g. cell or spot) and each column correspond to one observation-level metadata field. It can be accessed via (adata.obs). Row names in the metadata need to match the raw names of the counts matrix.

2.2.3 object with features-level metadata

feature-level metadata is stored as a data.frame, where each row correspond to one feature (e.g. gene ) and each column correspond to one feature-level metadata field. It can be accessed via (var.obs). Row names in the metadata need to match the column names of the counts matrix.

AnnData is specifically designed for matrix-like data. By this we mean that we have observations (cells), each of which can be represented as -dimensional vectors, where each dimension corresponds to a variable or feature (genes). Both the rows and columns of this matrix are special in the sense that they are indexed

3 Analysis pipeline in squidpy

The general steps to perform preprocessing, dimensiona reduction and clustering for spatial transcriptomics data are quite similar to the scanpy workflow analyzing single-cell RNA sequencing data.

3.1 Preprocess data

A standard preprocessing workflow includes the selection and filtration of cells based on quality control (QC) metrics, data normalization and (optional) scaling, and the detection of highly variable features.

3.1.1 Quality control

A few common QC metrics include

3.1.2 Normalization

The variance of molecular counts expresses spatial heterogeneity which cannot be solely explained by technical noise. we recommends normalization using normalize_total in order to account for technical bias while preserving true biological differences.

3.2 Downstream tasks (dimension reduction, data visualization, cluster annotation, differential expression)

find marker genes

Let us compute a ranking for the highly differential genes in each cluster. For this, by default, the .raw attribute of AnnData is used in case it has been initialized before. The simplest and fastest method to do so is the t-test.

expression of some genes in cluster

We see that cbln1 recapitulates the spatial structure.

3.3 Visualization in spatial coordinates

now take a look at how total_counts and n_genes_by_counts behave in spatial coordinates. We will overlay the circular spots on top of the Hematoxylin and eosin stain (H&E) image provided, using the function sc.pl.spatial.

Before, we performed clustering in gene expression space, and visualized the results with UMAP. By visualizing clustered samples in spatial dimensions, we can gain insights into tissue organization and, potentially, into inter-cellular communication.

Image features

now in this block we calculate image features for each Visium spot and create a obs x features matrix in adata that can then be analyzed together with the obs x gene gene expression matrix using function calculate_image_features()

We can use the extracted image features to compute a new cluster annotation. This could be useful to gain insights in similarities across spots based on image morphology.

3.4 Spatial statistics and graph analysis

we can investigate spatial organization by leveraging spatial and graph statistics in Visium data.

3.4.1 Neighborhood enrichment

Computing a neighborhood enrichment can help us identify spots clusters that share a common neighborhood structure across the tissue. We can compute such score with the following function: squidpy.gr.nhood_enrichment(). In short, it’s an enrichment score on spatial proximity of clusters: if spots belonging to two different clusters are often close to each other, then they will have a high score and can be defined as being enriched. On the other hand, if they are far apart, and therefore are seldom a neighborhood, the score will be low and they can be defined as depleted. This score is based on a permutation-based test,Since the function works on a connectivity matrix, we need to compute that as well. This can be done with squidpy.gr.spatial_neighbors().

3.4.2 Co-occurrence across spatial dimensions

we can visualize cluster co-occurrence in spatial dimensions. This is a similar analysis of the one presented above, yet it does not operate on the connectivity matrix, but on the original spatial coordinates , where p(exp/cond) is the conditional probability of observing a cluster(exp) conditioned on the presence of a cluster (cond) , whereas P(exp) is the probability of observing in the radius size of interest. The score is computed across increasing radii size around each observation (i.e. spots here) in the tissue.

3.4.3 Ligand-receptor interaction analysis

Now we provide a fast re-implementation the popular method CellPhoneDB and extended its database of annotated ligand-receptor interaction pairs with the popular database Omnipath . we run the analysis for all clusters pairs, and all genes using function squidpy.gr.ligrec(). then we visualize the results, filtering out lowly-expressed genes (with the means_range argument) and increasing the threshold for the adjusted p-value (with the alpha argument).

3.5 Identify spatially variable genes

Anndata is able to calculate two metrics in order for the users to identify spatially variable genes, namely Moran’s I statistic Spatial coordinates of the cells are incorporated to identify features with spatial heterogeneity in its expression.

3.5.1 Moran’s I

Moran’s I is a global metric measuring the correlation of gene expression values between local observed values and the average of neighboring values. We can interpret Moran’s I as a spatial autocorrelation metric similar to the Pearson correlation coefficient in the context of spatial statistics. We can either calculate the z-score under normality assumption or perform permutation test to see whether we can reject the null hypothesis of zero spatial autocorrelation.

Top 3 genes with the largest Moran's I

Bottom 3 genes with the smallest Moran's I